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Abstract. We describe a modular system for generating sentences from formal 
definitions of underlying linguistic structures using domain-specific languages. The 
system uses Java in general, Prolog for lexical entries and custom domain-specific 
languages based on Functional Grammar and Functional Discourse Grammar no- 
tation, implemented using the ANTLR parser generator. We show how linguistic 
and technological parts can be brought together in a natural language processing 
system and how domain-specific languages can be used as a tool for consistent 
formal notation in linguistic description. 



1 Motivation and Overview 

This paper describes a system for generating sentences using domain-specific 
languages (DSL; see section [3]) for the formal representation of underlying 
linguistic structures and lexical entries.^ The DSL implemented for under- 
lying structures is based on representations in Functional Grammar (FG; 
Dik 1997). The grammar module and the lexicon are based on a revised and 
extended version of the implementation described in Samuelsdorff (1989). 
To evaluate the flexibility of our approach, we also implemented domain- 
specific languages for formal representations in Functional Discourse Gram- 
mer (FDG), which as FG explicitly demands "formal rigor" (Hengeveld and 
Mackenzie 2006:668). Creating a computational implementation is a valu- 
able evaluation tool for linguistic theories in general (cf. Bakker 1994:4). 
By actually generating linguistic expressions from representations used in a 
linguistic theory, an implementation can be used to evaluate and improve 
representational aspects of the theory. 

The described implementation and infrastructure for collaborative development are available 
online (http : //f gram, sourcef orge .net ). 
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2 System Architecture 



The system consists of individual, exchangeable modules for creating an un- 
derlying structure, processing that input and generating a linguistic expres- 
sion from the input (cf. Fig. [l] for an overview of the system architecture). In 
the input module an underlying structure is created, edited and evaluated. 
The input is sent to the processing module, which communicates with the 
grammar module. When the generation is done, the user interface displays 
either the result of the evaluation, namely the linguistic expression gener- 
ated from the input, or an error message (cf. Fig. [s] for sample output of the 
console-based implementation of the input module). The system architecture 
can be characterized as a three-tier architecture (Eckerson 1995). 

Such a modular approach has two main advantages. First, modules can 
be exchanged; for instance the input module is implemented both as a desk- 
top application and as a web-based user interface with the actual processing 
happening on a server (implemented using Java Server Pages on a Tomcat 
servlet container). Second, individual modules of our system can be com- 
bined with other natural language processing (NLP) components and so be 
reused in new contexts. 



3 Domain-Specific Languages 

The usage of languages which are tailored for a specific domain (domain- 
specific languages, DSL) has a long tradition in computing (e.g. for config- 
uration files) and has been acknowledged as a best practice in recent years 
(cf. Hunt and Thomas 1999). Domain-specific languages are also a central 
aspect of a programming paradigm called language-oriented programming 
(cf. Ward 2003). 

Our system uses Java as a general-purpose language, Prolog as a DSL 



for lexical entries and expression rules (see section 4.3, cf. Macks 2002 for 
a similar usage of Prolog), and a custom DSL for describing underlying 
structures, implemented using ANTLR, a tool for defining and processing 
domain-specific languages (Parr 2007, http : //www . antlr . org/). While e.g. 
in the domain of banking a DSL might describe credit rules, a linguist work- 
ing with a model like FDG uses a DSL for linguistic description, in particular 
for the formal notation of underlying linguistic structures. With ANTLR, 
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Fig. 1: System architecture 
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the form of the DSL is defined using a notation based on the Extended 
Backus-Naur Form (EBNF, cf. Wirth 1977, see Fig. |6]for the format used 
by ANTLR). From that grammar definition a Java parser that can process 
the DSL is automatically generated by ANTLR. 



4 Linguistic Structures 

4.1 Structures in Functional Grammar 

The processing module's input format is a representation of the linguis- 
tic expression to be generated (cf. Fig. [2]and[3]); its form is based on the 
representation of underlying structures given in Dik (1997). The processing 
module parses the input entered by the user and creates an internal object 
representation (cf. Fig. |4]). This is then converted into the output format 
of the processing module, a Prolog representation of the input (cf. Fig. |5]), 



which is used by the grammar module (cf. section 4.3). The mapping of the 
values used in the Prolog representation to those used in the input structure 
(like m to plural) is done in a Java properties file and therefore allows for 
configuration of the formal aspects of the input (which uses e.g. m) inde- 
pendently of the implementation code that generates the expression (which 
uses e.g. plural). 



(Past e: 

(dlx: 'man' [N] : 

(Past Pf e: 'give' [V] 
(dlx: 'mary' [N] )Ag 
(dmx : ' book ' [N] : ' old ' [A] ) Go 
(x: 'man' [N] )RecSubj 

) 

) 

(dlx: 'John' [N] )0 

) 

Fig. 2: A nested underlying structure in Functional Grammar based on Dik 
(1997), which is parsable by the generated ANTLR v2 parser (rep- 
resents John is the man who was given the old book by Mary) 



4 



» (e: 'love' [V] : (x: 'man' [N])AgSubj (x: 'woman' [N] )GoObj) 
The man loves the woman 



» (Past pf e:'give'[V]: 

(dmx: 'farmer' [N] : 'old' [A])AgSubj 

(imx: 'duckling' [N] : 'soft' [A])GoObj 

(dmx: 'woman' [N] : 'young' [A])Rec) 
The old farmers had given soft ducklings to the young women 

Fig. 3: Sample output of the console-based implementation of the input 
module: a linguistic structure conforming to Functional Grammar 
notation is entered at the prompt (^), for which the linguistic ex- 
pression is generated using the linguistic knowledge in the grammar 
module 



Node x2:Term 

lexeme = farmer 
modif = old 



Node x1 iPredicate 




Node x3:Term 


lexeme = give 
tense = past 


lexeme = duckling 
modif = soft 


« 



Node x4:Term 

lexeme = woman 
modif = young 



Fig. 4: Internal representation of the second structure in Fig. |3j (represents 
The old farmers had given soft ducklings to the young women): a 
tree of Java objects (in UML notation) 
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node(xl , 


0). nodeCx2, 1). 


prop (clause, illocution, decl) 


node (x3 , 


1). nodeCx4, 1). 


prop (clause, type, mainclause) 


propCxl , 


type , pred) . 


prop(x3. 


type , term) . 


propCxl , 


tense , past) . 


prop(x3. 


role , goal) . 


propCxl , 


perfect , true) . 


prop(x3. 


relation, object) . 


propCxl , 


progressive, false). 


prop(x3. 


proper , false) . 


propCxl , 


mode , ind) . 


prop(x3. 


pragmatic, null) . 


propCxl , 


voice , active) . 


prop(x3 , 


num, plural) . 


propCxl , 


subnodes, [x2, x3, x4] ) . 


prop(x3. 


modif s , [soft] ) . 


propCxl , 


lex , 'give ' ) . 


prop(x3. 


lex, 'duckling'). 


propCxl , 


nav, [V]). 


prop(x3. 


nav, [N]). 


propCxl , 


det , def ) . 


prop(x3. 


det , indef ) . 


propCx2, 


type , term) . 


prop(x4. 


type , term) . 


propCx2, 


role , agent) . 


prop(x4. 


role, recipient). 


propCx2, 


relation, subject) . 


prop (x4 , 


relation, restarg) . 


propCx2, 


proper, false) . 


prop(x4. 


proper , false) . 


propCx2, 


pragmatic, null). 


prop(x4. 


pragmatic, null). 


propCx2, 


num, plural) . 


prop(x4. 


num, plural) . 


propCx2, 


modif s , [old] ) . 


prop(x4. 


modif s , [young] ) . 


propCx2, 


lex , 'farmer' ) . 


prop(x4. 


lex, 'woman' ) . 


propCx2, 


nav, [N]). 


prop(x4. 


nav, [N]). 


propCx2, 


det, def) . 


prop(x4. 


det, def). 



Fig. 5: Prolog representation of tlie second structure in Fig. |3| which is 
generated from the object representation in Fig. |4] and used to create 
the hnguistic expression The old farmers had given soft ducklings to 



the young women in the grammar module (cf. section 4.3) 
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4.2 Structures in Functional Discourse Grammar 



To evaluate the flexibility of our approach, we implemented grammars for 
structures on the Representational Level (RL) and the Interpersonal Level 
(IL) in Functional Discourse Grammar (FDG, Hengeveld and Mackenzie 
2006), the successor theory of FG. Fig. [6] shows the grammar for structures 
on the RL, from which a parser is generated that can parse expressions like 
the structure in Fig. [7] into a structure as in Fig. [8j 



grammar Representational; 



content 


'(' 


OPERATOR? 


'P' 


X 


( ' 


' head ' ( 


' 'p' 


X 




)* 


')' 


FUNCTION? 


soaf fairs 


'(' 


OPERATOR? 


'e' 


X 


( ' 


' head ' ( 


> .g. 


X 


')' 


)* 


')' 


FUNCTION? 


property 




OPERATOR? 


'f ' 


X 


( ' 


' head ' ( 


' 'f 


X 


')' 


)* 


')' 


FUNCTION? 


individual 




OPERATOR? 


'x' 


X 


( ' 


' head ' ( 


' 'x' 


X 


')' 


)* 


')' 


FUNCTION? 


location 


'(' 


OPERATOR? 


'1' 


X 


( ' 


' head ' ( 


' '1' 


X 


')' 


)* 


')' 


FUNCTION? 


time 


'(' 


OPERATOR? 


't' 


X 


( ■ 


' head ' ( 


' 't' 


X 


')' 


)* 


')' 


FUNCTION? 



head : LEMMA? ( ' [' 

( soaffairs 
I property 
I individual 
I location 
I time )*']')?; 

FUNCTION : 'Ag' 
I 'Pat' 

I 'Inst' ; //etc. 

OPERATOR : 'Past' 

I 'Pres' ; //etc. 

LEMMA : 'a' . . 'z'+ ; 

X : '0' . . '9'+ ; 

Fig. 6: Complete ANTLR v3 grammar for structures on the Represen- 
tational Level in Functional Discourse Grammar, which describe 
nested structures as in Fig. [7} each head element can take differ- 
ent forms {content, soajfairs, property, individual, location, time), 
which themselves contain a head element again 
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(pi: [ 

(Past el: [ 
(f l:tek[ 

(xl : im(xl) ) Ag 
(x2 : naif (x2)) Inst 
](fl)) 
(f2:kot[ 

(xl : im(xl) ) Ag 
(x3:mi(x3))Pat 
](f2)) 
] (el)) 
](pl)) 

Fig. 7: Underlying structure of a serial verb construction in Jamaican Creole 
(for im tek naif kot mi, 'He cut me with a knife', Patrick 2004:290) 
on the Representational Level in Functional Discourse Grammar, 
which is parsable by the parser generated from the rules in Fig. |6| 
This representation is based on our analysis of the serial verb con- 
struction as a single event, which can be backed by native speaker 
intuition and semantic analysis (Durie 1997:291); an analysis of a se- 
rial verb construction with two events as given in Example 2 of van 
Staden (2006) can also be represented using the domain-specific lan- 
guage, while variations in the formal structure would be recognized 
as invalid 



property 



head 



individual 



head 



Fig. 8: Part of the parse tree the parser generated from the rules in Fig. [6 
produces for the structure in Fig. iTl 
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An ANTLR grammar definition like this provides a validator for the 
formal structure of RL representations and can be used with a tool like 
ANTLRWorks (http://www.antlr.org/works/) to analyse these represen- 
tations. Having an internal representation of the input (cf. Fig. |8]), alter- 
native processing to the creation of the corresponding linguistic expression 
(as have described for FG structures in section 4.1) is feasible, like output 
of typeset representations of underlying structures in different formats. This 
would allow the representation used for publication (e.g. with subscript index 
numbers, with or without indentation, etc.) to be created from the formal, 
validated representation. 



4.3 Lexical Entries 

In the grammar module the Prolog representation of the input generated by 
the processing module (cf. Fig. |5]) is used to generate a linguistic expression. 
Prolog offers convenient notation and processing mechanisms, e.g. lexical 
entries can be stored directly as Prolog facts (cf. Fig. |9]). Prolog also has 
a particular strong standing as an implementation language for FG (e.g. 
Connolly 1986; Samuelsdorff 1989; Dik 1992). By restricting the usage of 
Prolog to the grammar module and combining^ it with other languages, 
instead of using it as a general-purpose programming language for the entire 
program, we use Prolog as a DSL in one of its original domains. 

The expression rules and the lexicon are based on a revised and extended 
version of the implementation described in Samuelsdorff (1989). To make 
the implementation work as a module in the described system, the user 
dialog of the original version (in which the underlying structure is built step 
by step) was replaced by the formal representation that is created in the 
input module and converted into a Prolog representation by the processing 



module (cf. section 4.1 ). This resembles the shift to a top-down organization 
(Hengeveld and Mackenzie 2006:668) in FDG, where the conceptualization 
is the first step, not the selection of lexical elements, as it was in FG and in 
the implementation described in Samuelsdorff (1989). 



^ For calling Prolog from Java we use Interprolog (http://www.declarativa.com/ 
interprolog/ ). The Prolog implementation we use is SWI-Prolog (http : //www . swi-prolog. 
org/). 
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verb( verb( 

believe, give, 
state, action, 
[regular, regular], [gave, given], 

[ [ 

[experiencer , human, XI], [agent, animate, XI], 

[goal, proposition, X2] [goal, any, X2] , 

[recipient, animate, X3] 

], ]. 
Sat Sat 

). ). 

Fig. 9: Transitive and ditransitive verbs as Prolog facts in the lexicon 



5 Conclusion 

We described a modular implementation of a language generation system, 
representing underlying structures and lexical entries using domain-specific 
languages (DSL). The system makes use of an input format based on Dik 
(1997) and consists of modules implemented in Java, Prolog and ANTLR^. 
As a first result, this shows that a DSL can be used as a very flexible linguistic 
expert front-end to a knowledge base in a different language (as we have 



shown in section 4A for underlying clause structures based on Functional 
Grammar that use a Prolog knowledge base). We believe this is a promising 
way how domain-specific linguistic knowledge can be applied in a natural 
language processing system. 

As all structures in FG and FDG, as well as the lexical entries (which 
are Prolog facts in our system) have a common tree structure, a unified 
implementation using ANTLR to define and process all these structures in 
the same manner as implemented and described for RL representations is 
feasible. So as a second result, this shows that the concept of a DSL is flexible 
enough to be applied for newer developments in linguistic theory (as we have 
shown for structures on the Representational Level in Functional Discourse 



Grammar in section 4.2) as well as for extensions of these (as we have shown 



for structures describing lexical entries in section 4.3). Therefore domain- 
specific languages can be used as a tool for consistent formal notation in 
linguistic description. In our view this encourages the implementation of a 



® ANTLR allows further processing in different target languages including Java, C, C++, C#, 
Objective-C, Python and Ruby. 
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full set of grammars for all the structures a linguist creates in linguistic 
description, which could be the core of software tools that would allow a 
linguist to create linguistic representations like a programmer writes code, 
a mathematician writes formulas or a musician writes notes: as something 
that can actually be validated and even executed in a reproducible manner. 
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